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Abstract: Interviewing potential employees is an essential component of the employment process. Many 
people have difficulty progressing through the one-on-one interview sessions, despite having performed 
exceptionally well in the earlier rounds of the competition. The very reason for this is that people do not 
conduct enough self-analysis on the facial expressions and degrees of confidence they project during 
interviews. The candidates’ technical, verbal, and logical skills are evaluated in a series of mock 
interviews; however, there are not enough resources available to help the candidates prepare for the 
actual face-to-face interviews. Using Convolution Neural Networks (CNN), the goal is to recognise and 
analyse the emotions that are being expressed by the candidates in order to determine the level of 
confidence that a person has. The computation of eye blink rate to detect anxiety and eye gaze tracking to 
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detect distraction are both utilised in this study so that the accuracy of the results can be improved. The 
results are then compiled and delivered to the candidate in the form of a report. This provides the 
interview candidates with valuable information that can effectively assist them in preparing for their one- 
on-one interviews. 


Keywords: HR interview, Convolution Neural Network (CNN), Eye blink rate, Eye-gaze tracking, 
Emotion Recognition. 


INTRODUCTION 


THE JOB INTERVIEW IS A NECESSARY EVIL FOR EVERYONE SEEKING GAINFUL EMPLOYMENT. THERE ARE 
TYPICALLY THREE OR FOUR PHASES TO A JOB INTERVIEW, DEPENDING ON THE COMPANY'S PREFERENCES, BUT 
THE FINAL PHASE IS ALWAYS AN IN-PERSON MEETING WITH THE CANDIDATE [1]. EVERY INTERVIEWER READS 
BETWEEN THE LINES OF A CANDIDATE'S BODY LANGUAGE, EYE CONTACT, AND OTHER BEHAVIOURS TO 
GAUGE HOW NERVOUS OR CONFIDENT THEY ARE. MANY JOB SEEKERS PREPARE FOR THE TECHNICAL AND 
LOGICAL COMPONENTS OF AN INTERVIEW, BUT THEY DON'T PUT IN THE SAME AMOUNT OF TIME TO PERFECT 
THEIR DEMEANOUR [2-7]. THIS IS WHY SOME APPLICANTS BREEZE THROUGH THE SCREENING PROCESS BUT 
FALTER IN THE HR INTERVIEW. TO PREPARE FOR AN INTERVIEW, CANDIDATES IN THIS SITUATION SHOULD 
READ UP ON INTERVIEW ANALYSIS. MOCK INTERVIEWS ARE TYPICALLY CONDUCTED BY TRAINED EXPERTS 
OR CAREER COUNSELLORS AS A PRACTISE FOR THE REAL THING. JOB-SEEKERS CAN LEARN A LOT ABOUT HOW 
TO HANDLE TOUGH INTERVIEW QUESTIONS BY PARTICIPATING IN MOCK INTERVIEWS [8-12]. CANDIDATES 
CAN BENEFIT FROM A MOCK INTERVIEW IN A NUMBER OF WAYS, INCLUDING: IMPROVING THEIR 
COMMUNICATION SKILLS; LEARNING NEW INTERVIEW METHODS; AND REDUCING THE ANXIETY THEY MAY 
FEEL BEFORE AN ACTUAL JOB INTERVIEW. CANDIDATES WHO ARE ATTENDING THEIR FIRST INTERVIEW CAN 
BENEFIT MUCH BY PARTICIPATING IN A MOCK INTERVIEW [13-19]. 


Advanced mock interview software online will play recordings of fake interviewers asking questions and 
will need candidates to give verbal responses. Some of them have time constraints to teach you to answer 
each question quickly but thoroughly [20-25]. There are apps that record the interview so you can review 
it afterwards, and others that do the same in real time. Facial expression recognition is one of the most 
promising areas of research in human-computer interaction [26-31]. The ability to recognise a person in a 
photograph or video clip is known as facial recognition technology, and it has been around for some time. 
During the hiring process, AI is used to analyse a candidate's facial expressions to determine whether or 
not their personality is a good fit for the position, and whether or not they are being truthful in their 
replies [32-39]. 


There are advantages to this method of hiring for both the company and the candidate [40]. Firstly, it 
allows candidates to take the interview whenever it is most convenient for them, and it allows recruiters to 
evaluate the data whenever it is most convenient for them. Second, unlike human recruiters, AI that has 
been correctly built does not have any implicit biases [41-45]. Third, it helps reduce the workload for 
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human resources and the hiring manager. Fourth, this approach's accessibility considerably increases the 
recruitment pool, allowing businesses to snag the top applicants from any background or economic 
situation. Since more and more businesses are using AI in their recruitment efforts, it's crucial that 
prospective employees have experience with the technology and a basic understanding of how it operates 
[46-49]. A comparable site that facilitates practise interviews can be useful to job-seekers. Those people 
looking for work should download an interview simulation programme so they can get used to the 
pressure of an actual interview [50]. This technique uses Convolution Neural Networks (CNN) to detect 
the candidates' emotions during the practise interviews. Anxiety and distraction are read from the 
candidates’ eye-blink rates and eye-tracking data. This provides them with a crucial means of preparation 
for the interview. A consolidated report of the findings is then made available to the user [51-62]. This 
report is useful for introspection and learning from one's own experiences of difficulty. Candidates can 
participate in the mock interviews as many times as they like till they are satisfied with their reports and 
feel prepared for the real thing [63]. 


Literature Survey 


Aspiring candidates can benefit from emotion detection for interview analysis. The literature that came 
before our proposed effort is rich with examples of other publications that examine this general topic. 
Emotional state detection using automated means has been a hot topic of study since ancient times [64- 
69]. As a result, there have been numerous developments in this area. In order to facilitate facial emotion 
recognition, it introduces an automatic, quick, and efficient face detection from an image. Images 
captured with a Sony Cyber-shot digital camera with 7.1 megapixels of resolution and an emotion 
database were taken into account in their design. The method incorporates skin detection methods such 
modified RGB, YCbCr, and HSV for enhanced performance. The experimental results demonstrate the 
effectiveness of the algorithm in identifying and pinpointing human faces in images [70]. 


Better results are achieved by the face detection algorithm thanks to its multi-stage nature; the algorithm 
first converts the image into the proper colour space (RGB, YCbCr, HIS), then detects the skin in the 
image, takes a combination of the skin-detected images, and finally draws a bounding box around the face 
region. Skin colour detection on adjusted versions of the RGB, YCbCr, and HSV colour spaces is used to 
identify people against a controlled background in this method. Results show that when it comes to skin 
area classification, YCbCr and HIS colour spaces are superior to RGB. Neither, however, was able to 
produce satisfactory outcomes. Front faces could be identified using the proposed hybrid approach. 
Recognizing facial expressions is impossible without a frontal view. This strategy was chosen because it 
was thought to be the most effective for identifying and categorising facial expressions. It's highly 
accurate and reliable in identifying facial features and skin tones. In everyday conversation, accessories 
like glasses or a hat might obscure a portion of a person's face. In previous facial recognition 
investigations, once facial characteristics were extracted from each image, a procedure was developed to 
fill in the gaps of occluded features. However, the gaps aren't always successfully filled in with plausible 
information [71-75]. 
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Because of this, it is challenging for robots to reliably identify emotions during real-time conversation. 
This is why the suggested emotion recognition method takes into account causal links between facial 
components even when the face is partially obscured. The relationships between the target attribute and 
the explanatory factors are used by Bayesian network classifiers to draw conclusions. Because of this 
property of the Bayesian network, the proposed system can identify emotional states without resorting to 
guesswork to fill in missing elements. Experiments showed that the proposed system was able to 
recognise the subject with a high degree of accuracy despite the fact that parts of their facial features were 
obscured [76-81]. 


Using a Bayesian network, the facial occlusion was taken into account by the emotion identification 
system. Using the suggested Bayesian network, we find robust associations between the various variables. 
Total recognition rates were high in the experiments, especially for the states of delight and surprise. 
More importantly, the experiment on emotion identification taking facial occlusion into account 
demonstrated that almost all recognition rates with the suggested method remained over 50%, and the 
errors were minor in comparison to the conventional method. These findings suggest that the system 
outperformed the status quo method, even when facial features were obscured, for identifying emotional 
states [82-93]. 


Multi-modal processing is made possible by the integration of fundamental video features with 
established speech, musical, and other sound event feature extraction paradigms in Version 2.0. The 
platform supports synchronised parameter tuning, online incremental processing, offline and batch 
processing, and the extraction of statistical functionalities (feature summaries) like moments, peaks, 
regression parameters, etc. from audio and video descriptors. Statistical classifiers like support vector 
machine models or file export for widely used toolkits like Weka or HTK are examples of what can be 
done with the features after they have been collected. Conventional speech, music, and video features, as 
well as Mel-frequency and related cepstral and spectral coefficients, Chroma, CENS, auditory model- 
based loudness, voice quality, local binary pattern, colour, and optical flow histograms, are all examples 
of available low-level descriptors. Not only that, but features like identifying faces and following the 
pitches of voices are also available. C++ is used for the implementation, and common open-source 
libraries for audio and video input over the internet are used for openSMILE. It's lightning fast, supports 
both UNIX and Windows, and can be easily expanded with plug-ins because to its modular, component- 
based architecture [94-99]. 


For exploratory feature generation in audio, the open-source openSMILE feature extractor continues to 
see vigorous development. Due to the core architecture's high degree of adaptability, several new features 
have been added since the 2010 release. Recent innovations include context-sensitive recurrent neural 
networks as a potent classifier and regressor, as well as multi-modal feature extraction through the 
incorporation of fundamental video features via OpenCV, enhanced audio descriptors such as 
psychoacoustic features, and a multi-loop mode opening up a plethora of possibilities for more complex 
multi-pass feature extraction procedures. To accommodate depth, colour pictures, and source audio 
following beamforming, a Microsoft Kinect input component was designed. Researchers in the field of 
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computational paralinguistics have embraced openSMILE to the point where it has become a de facto 
standard reference toolset [100-101]. 


Several automated algorithms have evolved from the analysis of interview footage as large industries 
move toward using them in their hiring processes. However, up to this point, only elementary attempts at 
emotion analysis have been made, such as counting the number of smiles on an interviewee's face [102- 
109]. In this study, they conducted tests on a custom-built acted interview corpus utilising a Speech-based 
Emotion Recognition (SER) system and a commercially available facial expression analysis toolbox 
(FACET). Results hinted at FACET's potential for use in emotion recognition, but the SER's advantages 
were fairly limited. Video comments from job candidates were analysed using SSP technology, which 
was then put to use in a variety of contexts like mentoring candidates and giving standard assessments for 
staff selection. Evaluation of Emotient's FACET toolkit to detect emotions and their usefulness in 
predicting the actors' performance was carried out using the video corpus recorded by asking actors to 
perform three types of scripts corresponding to different levels of composure competencies observed from 
call-center employees. FACET's automatic detection of emotion intensity provided some encouraging 
signs that it may be used to predict how well a candidate will do in an interview setting. Although there is 
a significant and active research community in the SER field, no commercially available SER 
technologies are robust and accurate enough to evaluate interview footage, in contrast to facial expression 
analysis. Speech emotion classifiers for four emotion categories (i.e., happy, sadness, anger, and panic) by 
employing MFCC-based features were constructed and extracted using openSMILE from the LDC 
emotional speech corpus using various open-source tools (openSMILE and WEKA). With only four 
actors and a narrow focus on one soft skill (calmness), the data set used in this initial study on creating 
computer tools to analyse interviewees' emotional behaviours is still rather tiny [110-117]. 


Facial expression, it says, is a form of non-verbal communication but plays a crucial part in both verbal 
and non-verbal communication. It communicates what it's like to be a human and what that person's 
mental state is like. Human-computer interaction is an area that has received considerable attention over 
the past few decades (HCI). The paper covers the basics of facial expression recognition, its uses, a survey 
of existing methods, and the many stages of an automatic facial expression recognition system. There has 
been a lot of work done over the past few decades by academics, businesses, and governments to develop 
more accurate ways of measuring honesty, deception, and credibility in interpersonal encounters. Every 
attempt has been made to capture a person's facial expressions. Since the face contains the majority of our 
sensory organs, we are able to express our emotions through facial expressions. Therefore, the 
expressions people make are taken seriously [118-125]. The study provided a concise overview of 
automatic emotion identification system methods, practical applications, and potential future research 
directions [126-131]. 


Features were calculated using coefficients defining aspects of face expressions recorded for six people. 
The features for the 3D face model have been computed. Features were sorted into categories. Excellent 
classification accuracy of emotions was reached in the studies for seven emotional states, at 96% for 
random division of data, and satisfactory classification accuracy was attained, at 73% for "natural" data 
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division. This was the outcome for all users when using the MLP classifier and the "natural" data partition 
(subject-independent). All tests were conducted with the user in the same position in relation to the Kinect 
device. Users' performances of particular facial expressions did impact the classification accuracy. 
Classification accuracy can also be affected by a wide variety of other elements under real-world settings 
[132-139]. 


The brain's locus coeruleus-norepinephrine system regulates physiological arousal and _ attention, 
including the dilation of the pupils. It has been used as a gauge for gauging mental exertion, neuronal 
gain, and how challenging a task actually is [140-147]. The rate at which your eyes close on their own can 
provide insight into the cognitive processes that underlie learning and goal-directed behaviour by 
correlating with dopamine levels in the brain. Together, three non-invasive markers of cognition with 
great temporal resolution and well-understood neurological foundations are gaze, pupil dilation, and blink 
rate. Potential applications to studies of learning, cognitive development, and plasticity are discussed, as 
well as a study of the neurological bases of pupil dilation and blink rate, along with instances of their use 
[148-151]. 


Children as young as three or four can learn to fool their parents with surprising ease. As a result, several 
books and psychological studies have been written to aid in the interpretation of tells on the face that 
indicate dishonesty. Non-visual saccadic eye movement rate increases when people lie, according to 
recent psychological studies. In this research, we offer a framework for automatic eye tracking and 
movement recognition and analysis that is both fast and accurate [152-159]. The iris and the four corners 
of the eye are monitored by the proposed system. The offline analysis phase involved analysing the path 
of these ocular features to identify and quantify a number of cues that can be utilised as indicators of 
deceit. The approach successfully locates the iris's centre within the pupil in 91.47 percent of attempts. 
There was a 99.3 percent success rate for Blink localization. To counteract dishonesty based on blink rate, 
the normalised blink rate deviation was also proposed. The dishonest responses in the Silesian Face 
database were identified with a recognition rate of 96.15 percent using this statistic in conjunction with a 
basic decision stump. Eye characteristics such as iris centres, approximative gaze direction, blink 
intervals, and blink rate were among those reported in this paper's automatic facial analysis method [160- 
171). 


These automated systems could use data on a person's mental state in the here and now to develop 
individualised plans for interacting with and caring for that person. A more helpful tactic could be to 
acknowledge the patient's feelings and express sympathy when they express distress, for instance. The 
quality of human-computer interaction can be greatly enhanced even outside of robotics with the help of 
computers that can recognise and respond to emotional states (HCI). Human-computer interaction (HCI) 
can be improved for the benefit of both humans and machines by being designed to mimic human 
interaction. The system was developed using OpenCy, and its algorithms, Fisherfaces and PCA/LDA, are 
what make up the system's core functionality. Additionally, the server was developed in Java to support 
the Android app [172-179]. 
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Over the past decade, a great deal of study has been conducted in this area. Human-computer interaction 
relies heavily on non-verbal cues like facial expressions [180]. The CNN model shown in this study can 
identify emotional expressions in live video. Viewers of movie trailers or educational videos might have 
their emotions analysed by the software. The method involves taking a series of pictures with a webcam 
and then using a model to identify the subject's feelings. Experimenting with the CNN model trained on 
the FER2013 dataset yielded an accuracy of 0.6012 in the test and 0.8978 in the validation. Facial 
expression recognition has long been a problem area in human-computer interaction (HCD), despite its 
importance in deciphering human emotions. Ten emotions from the Amsterdam Dynamic Facial 
Expression Set- Bath Intensity Variations (ADFES-BIV) dataset were used in this study, and they were 
put through their paces on two different test datasets. No hand-crafted characteristics were used in the 
proposed algorithm for DCNN-based emotion recognition in video. With only visual data evaluated, the 
model performed exceptionally well in identifying all ten emotions. The next step is to try using this 
method in tandem with other modalities, such as the audio modality, and expanding into new datasets 
[181-187]. 


The accuracy of facial expression identification can be improved in various ways, but one of the most 
effective is to modify the training framework and the preprocessing of images. One issue that currently 
exists is that there may be moments during which the camera's ability to capture photos at fast speed is 
compromised owing to the influence of light or other variables [188-191]. Facial expression recognition 
systems may become inaccurate if exposed to such variations. This issue was resolved by accounting for 
the variations in image attributes during high-speed capturing, allowing the system to continue 
functioning normally and preserving recognition speed. Instead of using the current output as a 
benchmark, the proposed method averages over the prior image to speed up recognition. The average 
weighting technique is the name given to this approach. This allows for less disruption from visual 
features to occur. The experimental results demonstrate that by using this strategy, the overall robustness 
and accuracy of facial expression recognition have been significantly enhanced in comparison to those 
obtained using only the convolution neural network (CNN) [192-195]. 


With an online mock interview system, students may practise for interviews at their convenience. Using 
photos captured in real time, they can fix any issues that arise throughout the interviews. An accurate 
diagnosis of the interviewee's emotional state based on the context is essential for providing coaching 
during such practise. In this research, we suggest employing multi-block deep learning to identify user 
emotions in self-management interview software [196]. The multi-block deep learning approach aids the 
user in learning after sampling the primary facial areas (eyes, nose, mouth, etc.), which are crucial for 
emotion analysis from face identification, as opposed to the fundamental framework for learning about 
whole-face photos. Multiple AdaBoost learning iterations are used in the sampling process for the multi- 
block procedure. Similarity measurement is also performed during this process for optimal block image 
screening and verification. The suggested system is evaluated against AlexNet, which is typically 
employed in facial recognition. We compare the area's recognition rate and extraction time against 
external standards. The proposed facial recognition method improved recognition rates by 3.75 percentage 
points and cut extraction times for the target area by 2.61 percent. The proposed deep learning method can 
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be used to set up a systematic interview system that can deliver high-quality, individualised interview 
instruction for job seekers [197-199]. 


The self-management interview system is a smartphone software designed to make interview preparation 
simple. The interview service records and uploads a live video to the server in real time. Emotional state 
is displayed using speech and facial recognition from the video at this moment, and real-time coaching is 
given accordingly. The proposed system is useful for speech and interview preparation since it features a 
variety of interview coaching content and self-diagnosis programmes. In contrast to the conventional 
framework for full-facial recognition, our system's deep learning approach to picture analysis aids the user 
in learning by sampling the essential areas that are crucial for sentiment analysis during face detection. 
Sampling in the multi-block process is carried out by a number of AdaBoosts. The primary traits that 
interfere with facial recognition are sampled, and then an XML classifier is used to recognise them at 
predetermined threshold levels. The CAS-PEAL face database is used to recognise eight emotions in the 
retrieved photographs (such as neutral, contempt, disgusted, angry, joyful, surprised, afraid, and fear), and 
services are made available via the programme. Since this is a mobile app, it is dependent on the applicant 
holding the phone at the correct angles to record the face, therefore any swaying could lead to inaccurate 
recognition. 


Meteorology 


Requirement specification is the process of writing out the needs identified in an analysis. It entails doing 
whatever requires doing to figure out what it takes to get the product, taking into account any and all 
probable user requirements. The system requirements are the more in-depth user needs definition, 
outlining a set of system services and limits in great detail. In some cases, it acts as a binding agreement 
between the customer and the software's creator. It is a rundown of the various pieces of hardware and 
software that must work together to complete a given operation. There is now a self-management 
interview App for mobile devices. The interview service records and uploads a live video to the server in 
real time. Emotional state is displayed using speech and facial recognition from the video at this moment, 
and real-time coaching is given accordingly. The proposed system is useful for speech and interview 
preparation since it features a variety of interview coaching content and self-diagnosis programmes. In 
contrast to the conventional framework for full-facial recognition, our system's deep learning approach to 
picture analysis aids the user in learning by sampling the essential areas that are crucial for sentiment 
analysis during face detection. Sampling in the multi-block process is carried out by a number of 
AdaBoosts. 


The primary traits that interfere with facial recognition are sampled, and then an XML classifier is used to 
recognise them at predetermined threshold levels. The CAS-PEAL face database is used to recognise 
eight emotions in the retrieved photographs (such as neutral, contempt, disgusted, angry, joyful, surprised, 
afraid, and fear), and services are made available via the programme. In addition to this function, 
iMotions can also be used to study human behaviour. Analysis of human behaviour is performed by 
means of facial expression analysis, eye tracking, electroencephalogram (EEG), and electrocardiogram 
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(ECG), with the data being used for whatever purposes the vendor sees fit. These goods and services are 
available to everybody. Due to the nature of the Self-management interview App being a mobile 
application, candidates will need to position their phones in just the right ways to record their faces, and 
any disruption in their concentration could lead to inaccurate recognition. The expensive equipment 
needed to make use of the iMotions platforms' sensors and wearable devices limits its utility. 
Unfortunately, mock interviews are an unnecessary luxury for many job seekers. 


Images from the FER (Facial Emotion Recognition) 2013 image dataset, which have been labelled as 
neutral, happy, sad, surprised, angry, disgusted, or fearful, are used in the proposed system. In order to 
train a detection model for the CNN, the dataset is first pre-processed (Convolution Neural Network). The 
first thing to do is take screenshots of the candidate during the simulated interview. Each image in the 
images folder is loaded into the detection model by the first module. Each frame's emotions are analysed 
by the detection model, which then returns the dominant emotion seen in that scene. Each image in the 
images folder is read by the subsequent module. Anxiety is detected by using the facial landmarks to 
determine an EAR (Eye Aspect Ratio) value, from which the number of blinks can be deduced. Finally, 
an eye-gaze tracking operation is carried out on the frames and a Gaze Direction vs. Frames graph is 
generated. The candidate's ability to focus during the test might be analysed in this way. These findings 
are then compiled into a PDF report that shows the candidate their Emotion Levels vs. Frames graph, their 
number of eye blinks, and their Gaze Direction vs. Frames graph. 


System requirements can be met through design by outlining the system's structure, parts, modules, 
interfaces, and data. The design of the system describes the system's architecture, as well as its functions 
and the modules that make up the system. In what follows, you'll find specifics on how our proposed 
model is constructed. An in-depth understanding of a model's process and its responsibilities can be 
attained through an examination of its system architecture. 


Result 


Before the model is constructed, the photos are processed and divided into training and testing sets. The CNN 
model is trained by being fed the training data. The emotion detection model is simply this trained CNN 
model. The candidate can now show up for the practise interview since the detection model is complete. 
Frame by frame photos of his or her face during the simulated interview are saved in a designated folder. For 
each image in the images folder, the newly developed Emotion Detection algorithm extracts its peak emotion 
and returns it. The top three emotions shown during the interview are listed in the final report alongside a 
graph depicting Emotion levels versus Frames. Sequential describes the CNN model's linear stack of layers. 
It's constructed by piling on layers of convolution and pooling in a staggered fashion. There are four groups 
involved in the detecting stage. The first model configuration includes a 2D convolution layer with 16 
kernels. The kernel's dimensions are 7x7. The activations from the preceding layer are then normalised by 
batch normalisation to speed up the model. 


The results are improved by repeatedly applying these two layers. After this comes an activation layer. 
Utilizing RELU activation. Finally, a dropout layer is implemented to safeguard against model overfitting. 
This set is repeated three more times, with the number of filters increasing by one for each iteration and the 
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size of the kernel decreasing by one for each set. This component systematically pulls images from the 
images folder. The eye blink detector is the module's starting point. The dlib library's pre-tratned model- 
shape predictor of 68 face landmarks is used to do the identifying. As long as the eye is open, the EAR ratio 
will not change. When the eye closes, the EAR ratio drops significantly. If this ratio is consistently off for 
multiple frames, it is considered a blink. Gaze tracking is the second component of the eye module. To begin, 
the candidate and webcam are used to determine the optimal binarization threshold value for the pupil 
detection method. The candidate's face landmarks are identified once the threshold values have been 
determined. The position of the pupil is estimated after the iris has been detected. Every time a pupil moves 
further than a predetermined distance from the image's centre, a new count is added to a running tally. These 
two measurements are used to analyse the candidate's nervousness and attention span during the interview. 


DISCUSSION 


The study found that while talking, the average adult blinks 26 times each minute. One study found that 
people's blink rates rose in response to feelings of excitement, fear, or irritation. The Gaze Direction vs. 
Frames graph and the deviation percentage are included in the final report alongside the eye blink rate and the 
candidate's relevant emotional state. A plan, idea, model, design, specification, standard, method, or policy 
can be implemented if it is put into practise. The purpose of the implementation process is to produce a 
system component that meets the specifications specified for that component during the design phase. The 
graphics are composed entirely of RGB pixels with values between 0 and 255. After normalisation, all of 
these numbers are transformed into the range from 0 to 1. After that, they are transformed once more into a 
range from -1 to 1. 


We employ a CNN model that uses two sets of convolution and batch normalisation layers sequentially. 
Filters, kernel size, name, input shape, and padding are all hardcoded in the convolution layer. The purpose of 
batch normalisation is to prevent excessive value fluctuations. After that, we incorporate activation and 
dropout layers. By doubling the number of filters in each set and reducing the kernel size gradually while 
keeping the rest of the parameters constant, we may add four more sets of the aforementioned layers. The 
Global Average Pooling layer is the last one and it combines all the previous layers' outputs into one. The 
CNN model developed in the preceding section is trained using the training set of the pre-processed dataset. 
After training is complete, we are provided with a feature map for each of the seven emotions. Testing is 
performed using the dataset's testing set. The CNN model has been trained and is now capable of emotion 
detection. 


The candidate's frame-by-frame photographs from the mock interview are saved in a folder. Each of these 
pictures is individually fed into the CNN model that was previously trained. Using an array comparison, we 
compare each image to the feature maps and calculate the frequency with which each expression appears in 
each image. The predominant feeling is always the one designated for a given picture. The 68 facial 
landmarks in each image are identified using a model-shape predictor that has already been trained on data 
from the dlib library. The EAR values are determined by taking into account the facial landmarks around the 
eye. If this ratio is consistently off for multiple frames, it is considered a blink. To begin, the candidate and 
webcam are used to determine the optimal binarization threshold value for the pupil detection method. The 
position of the pupil is estimated after the iris has been detected. Every time a pupil moves further than a 
predetermined distance from the image's centre, a new count is added to a running tally. A single PDF is 
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generated with the combined outcomes of the aforementioned three modules. On the first page, you'll see a 
chart showing how the interviewee's emotional state changed over time, broken down by frame type. The 
second page shows the total number of blinks and the corresponding mood. On the final page, we have a 
graph of Gaze Direction versus Frames, along with the percentage of variation. 


CONCLUSION 


Everyone who uses the system is subjected to a simulated interview and given a report on their 
performance. There are three parts to this report. In the first part of the report, we see a chart depicting the 
client's emotional state over time, broken down by frame. In the second part, you can see how many times 
someone has blinked and gauge whether or not they are scared. Finally, a Gaze Direction vs. Frames 
graph and the Deviation percentage (the fraction of time the individual was not making eye contact) show 
where the person's gaze has been during the entire recording. Attendees of the mock interview can utilise 
these findings to conduct an honest evaluation of themselves and prepare more effectively for the real 
thing. Therefore, the system uses features like face emotion detection, eye blink computation, and eye 
gaze tracking to report on the candidate based on their performance in a simulated interview setting. 
Individuals can use this report to reflect on and adjust their performance in upcoming interviews. In the 
future, we may be able to analyse a candidate's voice by recording their conversation. The candidate also 
participates in a simulated interview, which can be combined with technologies like facial expression 
analysis and eye tracking to provide a more complete picture of the candidate's demeanour. 
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